Dyr og Data

Data pivoting with tidyr

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-04

Learning objectives

In this section of the data wrangling topic you will

  • Learn about the concept of tidy data

  • Learn how to convert data between wide and long formats using the tidyr 📦

Tidying data with tidyr

Tidy data

Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). – Hadley Wickham

A way to arrange data that facilitates subsequent exploration, transformation, visualization, & modelling of the data where

  • each variable is a column; each column is a variable.
  • each observation is a row; each row is an observation.
  • each value is a cell; each cell is a single value.

Tidy data

How to represent data

Look at three different ways we could store data on Bovine tuberculosis (TB) incidents in cattle herds in Great Britain:

# A tibble: 1,017 × 4
   country date       n_herds n_cases
   <chr>   <date>       <dbl>   <dbl>
 1 England 1996-01-31   91024     355
 2 England 1996-02-29   90735     392
 3 England 1996-03-31   90512     440
 4 England 1996-04-30   90210     454
 5 England 1996-05-31   89357     434
 6 England 1996-06-30   88554     403
 7 England 1996-07-31   87802     387
 8 England 1996-08-31   87356     379
 9 England 1996-09-30   86879     313
10 England 1996-10-31   86065     274
# ℹ 1,007 more rows
# A tibble: 2,034 × 4
   country date       type    count
   <chr>   <date>     <chr>   <dbl>
 1 England 1996-01-31 n_herds 91024
 2 England 1996-01-31 n_cases   355
 3 England 1996-02-29 n_herds 90735
 4 England 1996-02-29 n_cases   392
 5 England 1996-03-31 n_herds 90512
 6 England 1996-03-31 n_cases   440
 7 England 1996-04-30 n_herds 90210
 8 England 1996-04-30 n_cases   454
 9 England 1996-05-31 n_herds 89357
10 England 1996-05-31 n_cases   434
# ℹ 2,024 more rows
# A tibble: 1,017 × 3
   country date       rate     
   <chr>   <date>     <chr>    
 1 England 1996-01-31 355/91024
 2 England 1996-02-29 392/90735
 3 England 1996-03-31 440/90512
 4 England 1996-04-30 454/90210
 5 England 1996-05-31 434/89357
 6 England 1996-06-30 403/88554
 7 England 1996-07-31 387/87802
 8 England 1996-08-31 379/87356
 9 England 1996-09-30 313/86879
10 England 1996-10-31 274/86065
# ℹ 1,007 more rows

Which is most useful?

How to represent data

The first version is most useful

bovine
# A tibble: 1,017 × 4
   country date       n_herds n_cases
   <chr>   <date>       <dbl>   <dbl>
 1 England 1996-01-31   91024     355
 2 England 1996-02-29   90735     392
 3 England 1996-03-31   90512     440
 4 England 1996-04-30   90210     454
 5 England 1996-05-31   89357     434
 6 England 1996-06-30   88554     403
 7 England 1996-07-31   87802     387
 8 England 1996-08-31   87356     379
 9 England 1996-09-30   86879     313
10 England 1996-10-31   86065     274
# ℹ 1,007 more rows

We have direct access to the variables of interest

Can easily wrangle the data

bovine <- bovine |>
  mutate(
    rate = (n_cases / n_herds) * 1000,
    year = format(date, "%Y")
  )
bovine
# A tibble: 1,017 × 6
   country date       n_herds n_cases  rate year 
   <chr>   <date>       <dbl>   <dbl> <dbl> <chr>
 1 England 1996-01-31   91024     355  3.90 1996 
 2 England 1996-02-29   90735     392  4.32 1996 
 3 England 1996-03-31   90512     440  4.86 1996 
 4 England 1996-04-30   90210     454  5.03 1996 
 5 England 1996-05-31   89357     434  4.86 1996 
 6 England 1996-06-30   88554     403  4.55 1996 
 7 England 1996-07-31   87802     387  4.41 1996 
 8 England 1996-08-31   87356     379  4.34 1996 
 9 England 1996-09-30   86879     313  3.60 1996 
10 England 1996-10-31   86065     274  3.18 1996 
# ℹ 1,007 more rows

Can easily wrangle the data

bovine |>
  group_by(country, year) |>
  summarise(total_cases = sum(n_cases))
# A tibble: 87 × 3
# Groups:   country [3]
   country year  total_cases
   <chr>   <chr>       <dbl>
 1 England 1996         4461
 2 England 1997         4921
 3 England 1998         6790
 4 England 1999         7668
 5 England 2000         8525
 6 England 2001         8318
 7 England 2002        17443
 8 England 2003        20466
 9 England 2004        19304
10 England 2005        22351
# ℹ 77 more rows

Wide or long?

Data can be in what we call wide format — where multiple observations are in a single row, or

Data can be in long format — where where each row corresponds to a single observation

We can usually make tidy data wider or longer to suit our needs

Wide or long?

Wide format

# A tibble: 3 × 30
# Groups:   country [3]
  country  `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 England    4461   4921   6790   7668   8525   8318  17443  20466  19304  22351
2 Scotland    412    491    562    562    283    133    429    501    411    279
3 Wales      1805   1161   1216   1500   1563   2019   4230   5590   5184   6206
# ℹ 19 more variables: `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
#   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
#   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>,
#   `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>, `2024` <dbl>

You’ll often see data entered / stored this way

Wide format

# A tibble: 87 × 14
# Groups:   country, year [87]
   country year    Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 England 1996    355   392   440   454   434   403   387   379   313   274
 2 England 1997    339   367   392   443   451   428   427   398   413   401
 3 England 1998    497   539   608   610   622   627   586   569   525   526
 4 England 1999    564   650   710   741   765   734   685   634   562   509
 5 England 2000    632   720   779   783   766   734   706   674   653   665
 6 England 2001    768   780   763   766   738   707   680   636   620   606
 7 England 2002    723   960  1182  1423  1561  1612  1647  1637  1630  1619
 8 England 2003   1808  1849  1991  1997  1915  1820  1707  1578  1501  1424
 9 England 2004   1497  1553  1700  1798  1796  1773  1688  1584  1471  1427
10 England 2005   1699  1913  2059  2098  2039  1946  1842  1753  1709  1719
# ℹ 77 more rows
# ℹ 2 more variables: Nov <dbl>, Dec <dbl>

You’ll often see data entered / stored this way

Long format

Data stored in databases are often stored in long (normalised forms)

# A tibble: 2,034 × 4
   country date       type    count
   <chr>   <date>     <chr>   <dbl>
 1 England 1996-01-31 n_herds 91024
 2 England 1996-01-31 n_cases   355
 3 England 1996-02-29 n_herds 90735
 4 England 1996-02-29 n_cases   392
 5 England 1996-03-31 n_herds 90512
 6 England 1996-03-31 n_cases   440
 7 England 1996-04-30 n_herds 90210
 8 England 1996-04-30 n_cases   454
 9 England 1996-05-31 n_herds 89357
10 England 1996-05-31 n_cases   434
# ℹ 2,024 more rows

Pivoting

Converting between a wide and long representations is called pivoting

gather() & spread()

An earlier implementation in tidyr 📦 used

  • gather()
  • spread()

roughly corresponding to

  • pivot_longer()
  • pivot_wider()

Many examples online and in 1st edition of r4ds used gather() and spread() but the pivot_*() functions are to be preferred now

pivot_longer()

How does pivoting data to a longer format work?

Say we have a data set with three dogs (A, B, and C) and we have taken their blood pressure twice each

dogs <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

If we want to plot or work with the BP measurements, we need these in a single column (variable)

pivot_longer()

dogs
# A tibble: 3 × 3
  id      bp1   bp2
  <chr> <dbl> <dbl>
1 A       100   120
2 B       140   115
3 C       120   125

Our aim is to have a data frame with three variables

  1. dog
  2. measurement
  3. value

We achieve this with pivot_longer()

pivot_longer()

dogs |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

pivot_longer()

The id column is already a variable

It needs to be repeated once per column that we are pivoting

Here we are pivoting two columns, so we repeat each id twice

pivot_longer()

The column names from dogs:

  • bp1
  • bp2

become the values in a new column, whose name was give by the names_to argument: "measurement"

The original column names need to be repeated once per row of the original data

pivot_longer()

The cell values in the original data also become a new variable

The name of this new variable is given by the values_to argument: "value"

They are unwound, row by row

pivot_longer()

dogs |> 
  pivot_longer(
    cols = bp1:bp2,           # which columns are we pivoting?
    names_to = "measurement", # variable to create in long form
    values_to = "value"       # variable to hold the cell values
  )
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

More complicated settings

Sometimes you will have data like this, with multiple variables per column

# A tibble: 3 × 679
  country  h_1996_Jan c_1996_Jan h_1996_Feb c_1996_Feb h_1996_Mar c_1996_Mar
  <chr>         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 England       91024        355      90735        392      90512        440
2 Wales         21296        152      21256        164      21195        182
3 Scotland      20080         32      20034         31      19989         31
# ℹ 672 more variables: h_1996_Apr <dbl>, c_1996_Apr <dbl>, h_1996_May <dbl>,
#   c_1996_May <dbl>, h_1996_Jun <dbl>, c_1996_Jun <dbl>, h_1996_Jul <dbl>,
#   c_1996_Jul <dbl>, h_1996_Aug <dbl>, c_1996_Aug <dbl>, h_1996_Sep <dbl>,
#   c_1996_Sep <dbl>, h_1996_Oct <dbl>, c_1996_Oct <dbl>, h_1996_Nov <dbl>,
#   c_1996_Nov <dbl>, h_1996_Dec <dbl>, c_1996_Dec <dbl>, h_1997_Jan <dbl>,
#   c_1997_Jan <dbl>, h_1997_Feb <dbl>, c_1997_Feb <dbl>, h_1997_Mar <dbl>,
#   c_1997_Mar <dbl>, h_1997_Apr <dbl>, c_1997_Apr <dbl>, h_1997_May <dbl>, …

We have variables named h_1996_Jan and c_1996_Feb, where

  • h indicates if the data are number of herds
  • c indicates if the data are number of cases

pivot_longer()

bovine_long <- bovine_wide |> 
  pivot_longer(
    cols = !country,                       # which columns are we pivoting?
    names_to = c("type", "year", "month"), # variables to create in long form
    names_sep = "_",                       # split names at _
    values_to = "n"                        # variable to hold the cell values
  )
bovine_long
# A tibble: 2,034 × 5
   country type  year  month     n
   <chr>   <chr> <chr> <chr> <dbl>
 1 England h     1996  Jan   91024
 2 England c     1996  Jan     355
 3 England h     1996  Feb   90735
 4 England c     1996  Feb     392
 5 England h     1996  Mar   90512
 6 England c     1996  Mar     440
 7 England h     1996  Apr   90210
 8 England c     1996  Apr     454
 9 England h     1996  May   89357
10 England c     1996  May     434
# ℹ 2,024 more rows

pivot_longer()

Conceptually this is not too different from what we already saw

Instead of the columns names pivoting into a single column, they pivot to multiple columns

pivot_wider()

Having now separated the data into a long format we should see that there are multiple variables in the type column

bovine_long |>
  distinct(type)
# A tibble: 2 × 1
  type 
  <chr>
1 h    
2 c    

The observations are at the level of country, by year and month

But the data for each observation are spread over two rows

pivot_wider()

We can use pivot_wider() to pivot the data to a wider format to create n_herds and n_cases columns

bovine_long |>
  pivot_wider(
    id_cols = c(country, year, month), # which columns are we *not* pivoting
    names_from = type,                 # column with names of new variables
    values_from = n                    # column with values for new variables
  ) |>
  rename(n_herds = h, n_cases = c)     # rename for nicer names
# A tibble: 1,017 × 5
   country year  month n_herds n_cases
   <chr>   <chr> <chr>   <dbl>   <dbl>
 1 England 1996  Jan     91024     355
 2 England 1996  Feb     90735     392
 3 England 1996  Mar     90512     440
 4 England 1996  Apr     90210     454
 5 England 1996  May     89357     434
 6 England 1996  Jun     88554     403
 7 England 1996  Jul     87802     387
 8 England 1996  Aug     87356     379
 9 England 1996  Sep     86879     313
10 England 1996  Oct     86065     274
# ℹ 1,007 more rows

pivot_wider()

To see how pivot_wider() works, we’ll revisit the small dog blood pressure data set, but this time we have only two dogs

dogs2 <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We want to create a data frame with column names taken from "measurement" and fill the cells with the values from "value"

pivot_wider()

We want to create a data frame with columns names taken from "measurement" and fill the cells with the values from "value"

dogs2 |> 
  pivot_wider(
    names_from = measurement, # which column(s) to take the names of new columns from
    values_from = value       # which column to take the values for the cells from
  )
# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

How pivot_wider() works — I

First, pivot_wider() has to identify what the new columns will be

dogs2 |> 
  distinct(measurement) |> 
  pull()
[1] "bp1" "bp2" "bp3"

How pivot_wider() works — II

By default, the rows in the output are determined from those variables not going to new variables (names) or values

These are the id_cols and should uniquely identify a row in the new (wider) data frame

Here we only have 1 ID column

dogs2 |> 
  select(-measurement, -value) |> 
  distinct()
# A tibble: 2 × 1
  id   
  <chr>
1 A    
2 B    

How pivot_wider() works — III

pivot_wider() then uses these combinations to create an empty data frame

dogs2 |> 
  select(-measurement, -value) |> 
  distinct() |> 
  mutate(bp1 = NA, bp2 = NA, bp3 = NA)
# A tibble: 2 × 4
  id    bp1   bp2   bp3  
  <chr> <lgl> <lgl> <lgl>
1 A     NA    NA    NA   
2 B     NA    NA    NA   

Where bp1 etc were the distinct values of measurement

How pivot_wider() works — IV

pivot_wider() then fills in the new data frame with the relevant values

Not every dog had three BP measurements, so there is no value to fill in the data frame with

In that case the value is NA (missing)

dogs2 |> 
  pivot_wider(
    names_from = measurement, # which column(s) to take the names of new columns from
    values_from = value       # which column to take the values for the cells from
  )
# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA